最近,已经提出了许多有效的变压器,以降低由软磁性注意引起的标准变压器的二次计算复杂性。但是,他们中的大多数只是用有效的注意机制交换SoftMax,而无需考虑定制的体系结构,特别是为了有效的关注。在本文中,我们认为手工制作的香草变压器体系结构可用于软马克斯的注意力可能不适合有效的变压器。为了解决这个问题,我们提出了一个新框架,通过神经体系结构搜索(NAS)技术找到有效变压器的最佳体系结构。提出的方法在流行的机器翻译和图像分类任务上进行了验证。我们观察到,与标准变压器相比,有效变压器的最佳体系结构的计算降低,但总体准确性较低。这表明SoftMax的注意力和有效的注意力具有自己的区别,但它们都无法同时平衡准确性和效率。这激发了我们混合两种注意力以减少性能失衡。除了现有NAS变压器方法中常用的搜索空间外,我们还提出了一个新的搜索空间,该空间允许NAS算法与架构一起自动搜索注意变体。 WMT'EN-DE和CIFAR-10上的广泛实验表明,我们的搜索架构与标准变压器保持了可比的精度,并具有明显提高的计算效率。
translated by 谷歌翻译
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.
translated by 谷歌翻译
对数据域转移的强大培训模型在学术界和行业中都引起了人们的兴趣。提问的语言模型是自然语言处理(NLP)研究的典型问题之一,在大型变压器模型的出现中取得了很大的成功。但是,现有方法主要是在训练和测试过程中从相同的分布中获取数据的,这在野外是不切实际且不可降低的。在本文中,我们探讨了学习域不变特征的对抗性训练方法,以便语言模型可以很好地推广到室外数据集。我们还检查了其他各种方法,以提高模型性能,包括通过释义句子来扩展数据,对开始单词的答案末端预测以及精心设计的退火功能。我们的初步结果表明,与这些方法相结合,我们能够在基线上获得$ 15.2 \%$改善的EM分数和5.6 \%$ $ boost的F1分数。我们还通过将模型输出投影到较低维空间来剖析模型输出,并可视化模型的隐藏状态,并发现我们的特定对抗性训练方法确实鼓励模型学习域不变性嵌入,并使它们更加接近多维空间。
translated by 谷歌翻译
与细粒度的视觉分类(FGVC)一样强大,响应您的查询与“Whip-Porl-Will”或“Mallard”的鸟类名称可能没有太大意义。然而,这通常在文献中接受,强调了一个基本的问题,互相和人类 - 从ai中学习的人类学习的可转让知识是什么?本文旨在使用FGVC作为测试床来回答这个问题。具体而言,我们设想一个场景,其中训练有素的FGVC模型(AI专家)作为知识提供商在使普通人(您和我)成为我们自己,即能够区分“Whip-Porp-Will”的人员和“野鸭”。图1列出了我们在回答这个问题时的方法。假设使用专家的人类标签培训的AI专家,我们问(i)我们可以从AI中提取的最佳可转让知识,并(ii)在给出这种知识的情况下衡量专业知识中的收益的最实用手段是什么?在前前,我们建议代表知识,作为专家专业的高度辨别性视觉区域。为此,我们设计了一个多阶段学习框架,它开始在差异地蒸馏他们的差异以获得专家专用知识之前建模域专家和新手的视觉注意。对于后者,我们模拟评估过程作为最佳习惯对人类习惯的学习实践的书籍指南。综合人类研究了15,000名试验表明,我们的方法能够始终如一地改善不同的鸟类专业知识,以识别一次无法辨认的鸟类。有趣的是,当所限制的提取知识被用作实现歧视性定位的手段时,我们的方法也会导致传统的FGVC性能。代码可用于:https://github.com/pris-cv/making-a-bird-ai-expert-work-for-you-and-me
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译